Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement block prefiltering for Joins of a lazy result + IndexScan #1647

Merged
merged 20 commits into from
Dec 3, 2024

Conversation

RobinTF
Copy link
Collaborator

@RobinTF RobinTF commented Nov 27, 2024

With this PR, a join between a lazy intermediate result and an IndexScan prefilters the blocks of the IndexScan before reading them. Before this PR, this prefilter was only applied to joins between two IndexScans and between an IndexScan and a fully materialized intermediate result.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 97.79736% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.57%. Comparing base (5c0d28a) to head (9934c62).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
src/engine/IndexScan.cpp 97.87% 2 Missing and 1 partial ⚠️
src/engine/Join.cpp 97.36% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1647      +/-   ##
==========================================
+ Coverage   89.53%   89.57%   +0.04%     
==========================================
  Files         381      381              
  Lines       36649    36792     +143     
  Branches     4146     4170      +24     
==========================================
+ Hits        32813    32957     +144     
- Misses       2520     2522       +2     
+ Partials     1316     1313       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first somewhat thorough pass on the code (not yet the tests).
I think I still found a bug, if you agree, fix it, and add a test that proves who of us was right.

src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.h Outdated Show resolved Hide resolved
src/engine/Join.cpp Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments.

src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
Comment on lines +480 to +496
rti.numRows_ = metadata.numElementsYielded_;
rti.totalTime_ = metadata.blockingTime_;
rti.addDetail("num-blocks-read", metadata.numBlocksRead_);
rti.addDetail("num-blocks-all", metadata.numBlocksAll_);
rti.addDetail("num-elements-read", metadata.numElementsRead_);

// Add more details, but only if the respective value is non-zero.
auto updateIfPositive = [&rti](const auto& value, const std::string& key) {
if (value > 0) {
rti.addDetail(key, value);
}
};
updateIfPositive(metadata.numBlocksSkippedBecauseOfGraph_,
"num-blocks-skipped-graph");
updateIfPositive(metadata.numBlocksPostprocessed_,
"num-blocks-postprocessed");
updateIfPositive(metadata.numBlocksWithUpdate_, "num-blocks-with-update");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this part also a member function of the LazyScanMetadata?
This is then better if we add additional members to it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an abstraction that can be moved into LazyScanMetadata, but the part that calls updateRuntimeInformationWhenOptimizedOut has to be part of a class related to Operation and the compressed relation reader class currently does not include RuntimeInformation, which means a new header would need to be added that isn't really related to this class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't feel that you have the time right now, limit yourself to the comments that are important for the main functionality of this PR.

The real solution here is that neither the LazyScanMetadata nor the CompressedBlockMetadata should be part of the large CompressedRelation.h file.

Comment on lines 503 to 515
Result::Generator generator_;
// The column index of the join column in the tables yielded by the generator.
ColumnIndex joinColumn_;
// Metadata and blocks of this index scan.
Permutation::MetadataAndBlocks metaBlocks_;
// The iterator of the generator that is currently being consumed.
std::optional<Result::Generator::iterator> iterator_ = std::nullopt;
// The index of the last matching block that was found using the join column.
std::optional<size_t> lastBlockIndex_ = std::nullopt;
// Values returned by the generator that have not been re-yielded yet.
std::deque<Result::IdTableVocabPair> prefetchedValues_{};
// Metadata of blocks that still need to be read.
std::vector<CompressedBlockMetadata> pendingBlocks_{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorder them, s.t. the :

  • Generator
  • iterator
  • joinColumn
  • prefetchedValues_
  • pendingBlocks
  • lastBlockIndex
  • the rest.

Then they are grouped correctly and easier to read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is currently mainly a result of moving all non default initialized values to the top

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case, for such an complex struct, write an explicit constructor which takes the arguments that you need and then you can order them semantically logic.

src/engine/IndexScan.cpp Outdated Show resolved Hide resolved
@@ -29,24 +29,14 @@ class Join : public Operation {
vector<float> _multiplicities;

public:
// `allowSwappingChildrenOnlyForTesting` should only ever be changed by tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to get rid of the tests that require this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very hard. The other constructor with the "only for testing" tag was previously used, but I had to change it to make it work with the new function signatures to the point where it pretty much ended up like a second constructor that does almost the same.

if (allowSwappingChildrenOnlyForTesting) {
std::swap(t1, t2);
std::swap(t1JoinCol, t2JoinCol);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tests does this break, and can they be fixed as part of this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a super strong opinion here, you can leave it for now, s.t. this
becomes mergeable in a finite amount of time:)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that use the "join lambda", see test/util/JoinHelpers.h

src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
src/engine/Join.cpp Outdated Show resolved Hide resolved
@sparql-conformance
Copy link

Copy link

sonarcloud bot commented Dec 3, 2024

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!

@joka921 joka921 changed the title Implement prefiltering for lazy result + IndexScan Implement block prefiltering for Joins of a lazy result + IndexScan Dec 3, 2024
@joka921 joka921 merged commit 44e2ba8 into ad-freiburg:master Dec 3, 2024
23 checks passed
@RobinTF RobinTF deleted the prefiltering branch December 3, 2024 15:45
hannahbast pushed a commit that referenced this pull request Dec 4, 2024
This fixes a bug introduced in #1647, which led to a failure of ` AD_CORRECTNESS_CHECK(!prefetchedValues.empty() || innerState->doneFetching_);` in `IndexScan::createPrefilteredJoinSide`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants